Near Duplicate Web Page Detection using NDupDet Algorithm
نویسندگان
چکیده
Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.
منابع مشابه
Identification of Duplicate News Stories in Web Pages
Identifying near duplicate documents is a challenge often faced in the field of information discovery. Unfortunately many algorithms that find near duplicate pairs of plain text documents perform poorly when used on web pages, where metadata and other extraneous information make that process much more difficult. If the content of the page (e.g., the body of a news article) can be extracted from...
متن کاملA Near-duplicate Detection Algorithm to Facilitate Document Clustering
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...
متن کاملAn Efficient Approach for Near-duplicate page detection in web crawling
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...
متن کاملDuplicate Web Pages Detection with the Support of 2d Table Approach
Duplicate and near duplicate web pages are stopping the process of search engine. As a consequence of duplicate and near duplicates, the common issue for the search engines is raising the indexed storage pages. This high storage memory will slow down the process which automatically increases the serving cost. Finally, the duplication will be raised while gathering the required data from the var...
متن کاملOptimization of Search Results with Duplicate Page Elimination using Usage Data
The performance and scalability of search engines are greatly affected by the presence of enormous amount of duplicate data on the World Wide Web. The flooded search results containing a large number of identical or near identical web pages affect the search efficiency and seek time of the users to find the desired information within the search results. When navigating through the results, the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013